1 Seoul Bike Sharing Business Story

Seoul Bike Sharing

Biking has become a modern, eco-friendly and comfortable way that many people nowadays use it as their daily transportation. Bike-sharing has been widely promoted in various metropolitan cities because it is not only just a tool hopping between different places, but also offers leisure and relaxing experience of transportation.

Seoul, the capital of South Korea, like many countries, has its own bike-sharing system. The system provides its service via an app, that allows registered users to access a city-wide bike fleet for private usage. Moreover, people can rent and return a bike from one bike station to another that belongs to the same network.

The Seoul bike-sharing system was launched in 2017. Its user-base has grown rapidly within a short time. In bike rental business, it is essential to have sufficient amount of bikes that meets customer needs. Thus, the availability and accessibility of bikes have listed questions of how can bike rental business maintain stable supplies to meet riders’ demands for bikes, and how does the business utilize the demand to implement price strategies (for example, fixed price at peak time while discounted price during the low period).

To solve these puzzles, having a prediction of bike demands would help business optimize its resources and operations and provide stable supply to riders. This project aims to use various machine learning models to predict demand for rental bikes. On the other hand, an hourly-based prediction of bike usage can be used for adjusting rental price accordingly with the demands, time, and data events.

image source: https://shorturl.at/btIOP

2 Data Processing and Understanding

2.1 Process the Dataset

The dataset is collected from Kaggle (source: https://www.kaggle.com/saurabhshahane/seoul-bike-sharing-demand-prediction). There are totally 8,760 rows × 14 columns in the original data set.

bikes <- read.csv("SeoulBikeData_Original.csv", header = TRUE, 
                  fileEncoding = "Latin1", check.names = F)
head(bikes)

The Data variables, type and description are shown in the following table. The variables of the data set meet our research requirements.

Data variables and description:

Variables Type Description
Date MM/DD/YY The date of the record (2017 Dec. -2018 Nov)
Rented Bike count Discrete The number of bikes rented for the hour of the data
Hour Categorical Each hour of the record for one day
Temperature(°C) Continuous The environment temperature in °C recorded
Humidity(%) Continuous The relative environment humidity in %
Wind speed (m/s) Continuous The speed that the air is moving in m/s
Visibility (10m) Continuous Distance (10 m scale) of clearness
Dew point temperature(°C) Continuous The temperature to which the air would have to cool to reach saturation
Solar Radiation (MJ/m2) Continuous The electromagnetic radiation emitted by the sun in MJ/m2
Rainfall(mm) Continuous Height of the precipitation in mm
Snowfall (cm) Continuous Height of the snowfalls in cm
Seasons Categorical Spring, Summer, Autumn, Winter
Holiday Categorical Holiday, Workday (No Holiday)
Functioning Day Categorical The days when the rental bike system operate (Yes) or does not operate (No)

All data were inspected first to see whether any NaN values and duplicated rows existing. The column ‘Date’ was divided into ‘Year’, ‘Month’, ‘Day’, and ‘Weekdays’ for analyzing purposes. After processing the data, the data set contains 8,760 rows × 21 columns.

The index of variables and its corresponding items:
Day_of_Week:{ ‘Sunday’: 1, ‘Monday’: 2, ‘Tuesday’: 3, ‘Wednesday’: 4, ‘Thursday’: 5, ‘Friday’: 6, ‘Saturday’: 7}

First, we checked whether there’s any duplicated or missing data in the data set.

anyDuplicated(bikes)
## [1] 0
sum(is.na(bikes))
## [1] 0

Here we change the data respectively to year, month, day, and day of week.

library(lubridate)

Hourr <- function(x) {
  val2<-paste(x,":00",sep="")
  return(val2)
}
bikes$hour1 <- unlist(lapply(bikes$Hour,Hourr))
bikes$Date <- strptime(as.character(bikes$Date),format="%d/%m/%Y")
bikes$year <- year(bikes$Date)
bikes$month <- month(bikes$Date)
bikes$day <- day(bikes$Date)
bikes$wday <- wday(bikes$Date)
bikes$fday <- bikes$'Functioning Day'
bikes$fday[which(bikes$fday=="Yes")] <- 1
bikes$fday[which(bikes$fday=="No")] <- 0
bikes <-subset(bikes,fday==1)
bikes$'Functioning Day' <- NULL
bikes$fday <- NULL
bikes$hour1 <- NULL

After transforming data related to date and time, we then filter out the “Yes” rows for the functioning day, since “No” means the service wasn’t working. All rows which ‘Functioning Day’ equals to 0 (did not operate) were deleted. Last, because all data left have functioning day = 1, this column is also deleted for simplifying the data set.

names(bikes)[2] <- "bike_count"
names(bikes)[3] <- "hour"
names(bikes)[4] <- "temperature"
names(bikes)[5] <- "humidity"
names(bikes)[6] <- "wind_speed"
names(bikes)[7] <- "visibility"
names(bikes)[8] <- "dew"
names(bikes)[9] <- "solar"
names(bikes)[10] <- "rainfall"
names(bikes)[11] <- "snowfall"
names(bikes)[12] <- "season"
names(bikes)[13] <- "holiday"

bikes$season <- as.factor(bikes$season)
bikes$year <- as.factor(bikes$year)
bikes$month <- as.factor(bikes$month)
bikes$day <- as.factor(bikes$day)
bikes$wday <- as.factor(bikes$wday)
bikes$hour <- as.factor(bikes$hour)
bikes$holiday <- as.factor(bikes$holiday)

We renamed the columns to simpler terms for coding purpose and set as factors to categorical variables.

bikes$highdemand <- bikes$bike_count
bikes$highdemand[bikes$bike_count >= 500] <- 1
bikes$highdemand[bikes$bike_count < 500] <- 0
bikes$highdemand <- as.factor(bikes$highdemand)
bikes <- bikes[c(1,14,15,16,17,3,2,4,5,6,7,8,9,10,11,12,13,18)]
str(bikes)
## 'data.frame':    8465 obs. of  18 variables:
##  $ Date       : POSIXlt, format: "2017-12-01" "2017-12-01" ...
##  $ year       : Factor w/ 2 levels "2017","2018": 1 1 1 1 1 1 1 1 1 1 ...
##  $ month      : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ day        : Factor w/ 31 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ wday       : Factor w/ 7 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ hour       : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ bike_count : int  254 204 173 107 78 100 181 460 930 490 ...
##  $ temperature: num  -5.2 -5.5 -6 -6.2 -6 -6.4 -6.6 -7.4 -7.6 -6.5 ...
##  $ humidity   : int  37 38 39 40 36 37 35 38 37 27 ...
##  $ wind_speed : num  2.2 0.8 1 0.9 2.3 1.5 1.3 0.9 1.1 0.5 ...
##  $ visibility : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 1928 ...
##  $ dew        : num  -17.6 -17.6 -17.7 -17.6 -18.6 -18.7 -19.5 -19.3 -19.8 -22.4 ...
##  $ solar      : num  0 0 0 0 0 0 0 0 0.01 0.23 ...
##  $ rainfall   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ snowfall   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ season     : Factor w/ 4 levels "Autumn","Spring",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ holiday    : Factor w/ 2 levels "Holiday","No Holiday": 2 2 2 2 2 2 2 2 2 2 ...
##  $ highdemand : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 2 1 ...

Using the 500 as a reference for determining high demand, the bike number greater than 500 is marked as ‘1’ (high demand), conversely the bike number less than 500 is marked as ‘0’(not high demand). After reordering the columns, our data set was ready for further modeling processes.

library(caret)
library(dplyr)
set.seed(1)
indices <- createDataPartition(bikes$hour, p = 0.8, list = FALSE)
train <- bikes %>% slice(indices)
test <- bikes %>% slice(-indices)
#write.csv(train, file="train.csv", row.names = F)
#write.csv(test, file="test.csv", row.names = F)

Last, we separated the dataset into 80% for train models and 20% to test models randomly by hour. (The datasets were exported as CSV files for team members to process models on their own devices.)

2.2 Graphic Analysis

2.2.1 Graphical Analysis - Continuous Variables

Graphics below show the data spreading in different continuous variables.

2.2.2 Graphical Analysis - Categorical Variables

Here are the data distribution of bike counts in different categorical variables.

According to the graphs, the predictors wday, day, year do not seem to be very important. From the boxplot of the year variable, we can see that we do not have data for the whole 2017 and 2018 years. Importance of predictors season, month, hour, holiday seems to be high.

2.2.3 Graphical Analysis - Binomial “high-demand”

If bike count is over 500, it’s categorized into high-demand(1); if it’s below 500, it’s low-demand (0). Graphs below show how high-demand data is distributed in different variables.

High-demand seem to show differences in most variables in the data set.

3 Linear Model

From a business perspective, it’d be interesting to know what factors impact the count of total rental bikes. However, since Linear Model can only be used to predict continuous variables, bike count, as a count variable, cannot be used directly as a predictor here. Therefore, we’ll transform the count data with the square-root function and fit a Linear Model.

3.1 Graphic Analysis

3.1.1 Effect of Various Factors on Bike Counts

The relationship between bike counts and temperature looks like a non-linear relationship, where the bike count increases by temperature until 30-degree and starts to decrease from this point. With humidity, wind speed and visibility, there were slight linear relationships in between, but the effects aren’t too obvious.

With variables such as dew points, solar radiation, rainfall and snowfall, the linear relationship stay pretty slight for all.

3.1.2 Interaction Examination

Instead of month, season was taken in the interaction plot to have a clearer picture. But month would be used as variable, if interaction exists.

Lines in the interaction plots of season/holiday and holiday/hour stay parallel.Therefore, there’s no significant interaction between these factors. However, intersection appeared between weekday and month, which means that there might be interaction between weekday/month and temperature/holiday effects on bike counts. We’ll consider this effect in the Linear and GAM models.

3.2 Modeling

3.2.1 Fitting the model with ALL variables

We start fitting the model by containing all variables and use drop1 function to examine variables that can be removed. As mentioned above, the count data of bike counts will be transformed with the square-root function here in order to fit a Linear Model.

lm.bikes.0 <- lm(sqrt(bike_count) ~ temperature + humidity + wind_speed + visibility + dew 
                 + solar + rainfall + snowfall + month + holiday + hour + wday, data = train)
drop1(lm.bikes.0, test = "F")
## Single term deletions
## 
## Model:
## sqrt(bike_count) ~ temperature + humidity + wind_speed + visibility + 
##     dew + solar + rainfall + snowfall + month + holiday + hour + 
##     wday
##             Df Sum of Sq    RSS   AIC  F value    Pr(>F)    
## <none>                   224483 23841                       
## temperature  1        86 224569 23842   2.5830    0.1081    
## humidity     1      7854 232337 24073 235.6326 < 2.2e-16 ***
## wind_speed   1        71 224554 23841   2.1217    0.1453    
## visibility   1       638 225121 23859  19.1422 1.232e-05 ***
## dew          1      1877 226360 23896  56.3202 6.951e-14 ***
## solar        1      1040 225523 23871  31.2095 2.407e-08 ***
## rainfall     1     20033 244516 24419 601.0201 < 2.2e-16 ***
## snowfall     1         0 224484 23839   0.0105    0.9186    
## month       11     59601 284084 25417 162.5598 < 2.2e-16 ***
## holiday      1      3581 228065 23947 107.4518 < 2.2e-16 ***
## hour        23    171322 395805 27643 223.4797 < 2.2e-16 ***
## wday         6      5724 230207 24000  28.6225 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the p-value, factors including temperature, wind speed and snowfall don’t have significant influence on bike count. Therefore, the model below will remove these predictors.

lm.bikes.1 <- lm(sqrt(bike_count) ~ humidity + visibility + dew + solar 
                 + rainfall + holiday + hour + wday, data = train)
drop1(lm.bikes.1, test = "F")
## Single term deletions
## 
## Model:
## sqrt(bike_count) ~ humidity + visibility + dew + solar + rainfall + 
##     holiday + hour + wday
##            Df Sum of Sq    RSS   AIC   F value    Pr(>F)    
## <none>                  285765 25451                        
## humidity    1     55747 341512 26658 1316.5820 < 2.2e-16 ***
## visibility  1       250 286016 25455    5.9152   0.01504 *  
## dew         1    199817 485582 29046 4719.1240 < 2.2e-16 ***
## solar       1      2304 288069 25504   54.4183 1.814e-13 ***
## rainfall    1     20299 306065 25915  479.4118 < 2.2e-16 ***
## holiday     1      4415 290180 25553  104.2663 < 2.2e-16 ***
## hour       23    183778 469543 28774  188.7099 < 2.2e-16 ***
## wday        6      5376 291141 25565   21.1592 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Under the drop1(), every variable is now significant. We’ll add interaction variables into the model and conduct further inspection.

lm.bikes.2 <- lm(sqrt(bike_count) ~ humidity + visibility + dew + solar + rainfall + holiday 
                 + hour + wday + temperature:holiday + wday:month, data = train)
drop1(lm.bikes.2, test = "F")
## Single term deletions
## 
## Model:
## sqrt(bike_count) ~ humidity + visibility + dew + solar + rainfall + 
##     holiday + hour + wday + temperature:holiday + wday:month
##                     Df Sum of Sq    RSS   AIC F value    Pr(>F)    
## <none>                           210544 23536                      
## humidity             1      6176 216721 23730 195.666 < 2.2e-16 ***
## visibility           1      1689 212234 23589  53.518 2.863e-13 ***
## dew                  1      1256 211801 23575  39.799 2.995e-10 ***
## solar                1       495 211039 23550  15.681 7.575e-05 ***
## rainfall             1     16563 227107 24048 524.699 < 2.2e-16 ***
## hour                23    172403 382948 27549 237.465 < 2.2e-16 ***
## holiday:temperature  2       772 211316 23557  12.228 5.004e-06 ***
## wday:month          77     73172 283716 25406  30.105 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Again every variable looks significant.

summary(lm.bikes.0)$r.squared
## [1] 0.7638281
summary(lm.bikes.1)$r.squared
## [1] 0.6993551
summary(lm.bikes.2)$r.squared
## [1] 0.7784927
anova(lm.bikes.0, lm.bikes.1)
## Analysis of Variance Table
## 
## Model 1: sqrt(bike_count) ~ temperature + humidity + wind_speed + visibility + 
##     dew + solar + rainfall + snowfall + month + holiday + hour + 
##     wday
## Model 2: sqrt(bike_count) ~ humidity + visibility + dew + solar + rainfall + 
##     holiday + hour + wday
##   Res.Df    RSS  Df Sum of Sq      F    Pr(>F)    
## 1   6735 224483                                   
## 2   6749 285765 -14    -61282 131.33 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(lm.bikes.0, lm.bikes.2)
## Analysis of Variance Table
## 
## Model 1: sqrt(bike_count) ~ temperature + humidity + wind_speed + visibility + 
##     dew + solar + rainfall + snowfall + month + holiday + hour + 
##     wday
## Model 2: sqrt(bike_count) ~ humidity + visibility + dew + solar + rainfall + 
##     holiday + hour + wday + temperature:holiday + wday:month
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1   6735 224483                                  
## 2   6670 210544 65     13939 6.7935 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
linear_model <- lm.bikes.2

Model 2 has the highest R-Square, followed by model 0 and model 1. Anova test, on the other hand, showed both model 1 and model 2 have significant difference to model 0. Consider the R-Square, we’ll choose the lm.bikes.2 model for the next model evaluation.

3.2.2 Model Evaluation and Prediction

pred.lm <- predict(linear_model, newdata = test)
R.lm <- cor(pred.lm, test$bike_count)^2
R.lm
## [1] 0.6971688

Conclusion: From the residual analysis graphs, a long-tail effect on the left side in the QQ plot. While the in-sample R-Square was 78%, it dropped into 70% in out-sample test. A linear model could explain about 70% of the dataset, however, it maybe not be the best model.

4 Generalised Additive Model

library(mgcv)
library(forecast)
library(stats)

4.1 Graphical analysis

Plotting the data.

bikes$Date <- as.Date(bikes$Date, format = "%Y-%m-%d")
p1 <- ggplot(data = bikes, mapping = aes(y = bike_count, x = Date)) +
      geom_point(color="gray25") + geom_smooth(color="brown3") + ylab("Rental Bike Count")
h2 <- ggplot(data = bikes, mapping = aes(x = bike_count)) + ggtitle("Histogram of Bikes Rentals") + geom_histogram(binwidth = 30)
grid.arrange(p1, h2, ncol = 2)

The distribution of Bikes Rentals is skewed. However, we will not log-transform the response variable (Bikes Rentals) as there are 0 values.

4.1.1 Graphical Analysis - Categorical Variables & Continuous Variables

Box plots showed in the 3.1 Graphic Analysis section indicates that, the predictors wday, day, year do not seem to be very important. From the boxplot of the year variable, we can see that we do not have data for the whole 2017 and 2018 years. Importance of predictors season, month, hour, holiday seems to be high.

Meanwhile, Scatter plots with Smoother in the Linear Model have shown the response variable against each continuous predictor, to determine which kind of effect predictors have on the response variable.

4.1.2 Panelling

Based on the boxplots, the predictors season0, holiday, hour, month are expected to be important. The effect of all the continuous predictors can differ among these categorical variables. In the graph below, we will inspect, whether temperature interacts with season.

ggplot(data = bikes, mapping = aes(y = bike_count, x = temperature)) +
      geom_point(color="gray25") + geom_smooth(color="brown3") + facet_wrap(. ~season)

All relationships do not seem to be linear. To estimate the kind of relationships, we will fit a GAM model and see what the estimated complexity of the smooth term is.

4.2 Modeling

4.2.1 Fitting the starting model

The starting GAM contains all continuous predictors which are taken as smooth terms. Since we are modeling count data, we will use the “quasipoisson” family.

gam.0 <- gam(bike_count ~ s(temperature) + s(humidity) + s(wind_speed) + s(visibility) + s(dew)
             + s(solar) + s(rainfall) + s(snowfall), data = train, family = quasipoisson())
summary(gam.0)
## 
## Family: quasipoisson 
## Link function: log 
## 
## Formula:
## bike_count ~ s(temperature) + s(humidity) + s(wind_speed) + s(visibility) + 
##     s(dew) + s(solar) + s(rainfall) + s(snowfall)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.291408   0.008926   704.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F p-value    
## s(temperature) 7.344  8.209 55.917  <2e-16 ***
## s(humidity)    8.864  8.990 33.166  <2e-16 ***
## s(wind_speed)  3.403  4.293 32.994  <2e-16 ***
## s(visibility)  1.932  2.403  4.317  0.0107 *  
## s(dew)         8.291  8.786 13.313  <2e-16 ***
## s(solar)       8.778  8.983 91.108  <2e-16 ***
## s(rainfall)    8.195  8.798 23.817  <2e-16 ***
## s(snowfall)    1.001  1.001  2.020  0.1552    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.612   Deviance explained = 64.5%
## GCV = 192.62  Scale est. = 203.89    n = 6785

If we use a strict 5% threshold, there is weak evidence that the smooth term of snowfall may play a role. According to the edf values, none of the strongly significant smooth terms has a linear effect. Most of the terms have very complex effects (more than 7). Removing s(snowfall).

gam.1 <- gam(bike_count ~ s(temperature) + s(humidity) + s(wind_speed) + s(visibility)
              + s(dew) + s(solar) + s(rainfall), data = train, family = quasipoisson())
summary(gam.1)
## 
## Family: quasipoisson 
## Link function: log 
## 
## Formula:
## bike_count ~ s(temperature) + s(humidity) + s(wind_speed) + s(visibility) + 
##     s(dew) + s(solar) + s(rainfall)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.291901   0.008919   705.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F p-value    
## s(temperature) 7.342  8.207 57.537  <2e-16 ***
## s(humidity)    8.860  8.990 34.060  <2e-16 ***
## s(wind_speed)  3.406  4.296 32.798  <2e-16 ***
## s(visibility)  1.932  2.403  4.296   0.011 *  
## s(dew)         8.329  8.805 13.463  <2e-16 ***
## s(solar)       8.780  8.983 90.763  <2e-16 ***
## s(rainfall)    8.191  8.796 23.734  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.612   Deviance explained = 64.5%
## GCV = 192.63  Scale est. = 204.26    n = 6785

All predictors are strongly significant (p-values<5%). We compare the two models above with the function anova().

anova(gam.0, gam.1, test="F")$Pr  # p-value
## [1]        NA 0.1608531

The output shows that there is no strong evidence that the 1st model better fits the data. Thus, we will further examine the less complex model.

4.2.2 Model “Development”

According to the graphical analysis, the factors season, month, holiday, hour, and wday seemed to be important. Hence, we will add these predictors to our model. However, since predictors season and month are closely related, we will choose one of the them month to have a more comprehensive insight.

gam.2 <- gam(bike_count ~ month + s(temperature) + s(humidity) + s(wind_speed) + s(visibility) 
             + s(dew) + s(solar) + s(rainfall), data = train, family = quasipoisson())
summary(gam.2)$r.sq
## [1] 0.64882
anova(gam.1, gam.2, test="F")$Pr
## [1]            NA 1.611917e-143

Adding the categorical variable hour to our model.

gam.3 <- gam(bike_count ~ hour + month + s(temperature) + s(humidity) + s(wind_speed) + 
            s(visibility) + s(dew) + s(solar) + s(rainfall), 
            data = train, family = quasipoisson())
capture.output(summary(gam.3))[54:61]
## [1] "s(visibility)  1.434  1.746   1.993 0.213747    "              
## [2] "s(dew)         7.027  7.962   6.755 9.21e-09 ***"              
## [3] "s(solar)       5.807  6.932   3.973 0.000242 ***"              
## [4] "s(rainfall)    8.757  8.978  86.217  < 2e-16 ***"              
## [5] "---"                                                           
## [6] "Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1"
## [7] ""                                                              
## [8] "R-sq.(adj) =  0.853   Deviance explained = 86.6%"
anova(gam.2, gam.3, test="F")$Pr
## [1] NA  0

There is a weak evidence that the smooth term of visibility may play a role (p-value<5%). Dropping visibility.

gam.4 <- gam(bike_count ~ hour + month + s(temperature) + s(humidity) + s(wind_speed) + s(dew)
             + s(solar) + s(rainfall), data = train, family = quasipoisson())
#summary(gam.4)
anova(gam.3, gam.4, test="F")$Pr
## [1]       NA 0.151628

We will further examine the less complex model (gam.5). Adding the categorical variable holiday.

gam.5 <- gam(bike_count ~ hour + holiday + month + s(temperature) + s(humidity) + s(wind_speed) 
             + s(dew) + s(solar) + s(rainfall), data = train, family = quasipoisson())
summary(gam.5)$r.sq
## [1] 0.8550654
anova(gam.4, gam.5, test="F")$Pr
## [1] NA NA

Adding the categorical variable wday.

gam.6 <- gam(bike_count ~ hour + holiday + month + wday + s(temperature) + s(humidity) + 
             s(wind_speed) + s(dew) + s(solar) + s(rainfall), 
             data = train, family = quasipoisson())
summary(gam.6)$r.sq
## [1] 0.8664235
anova(gam.5, gam.6, test="F")$Pr
## [1]           NA 1.874588e-66

All predictors of gam.5 and gam.6 are strongly significant (p-values<5%). The higher adjusted R-squared value and the ANOVA output show that there is strong evidence that the model with more parameters (gam.6) better fits the data.

4.2.3 Interactions with smooth terms

In the graphs, we have seen that temperature appeared to interact with categorical variables. We will inspect whether a different smooth term for temperature is needed for each month.

gam.7 <- gam(bike_count ~ hour + holiday + month + wday + s(temperature, by = month) + 
               s(humidity)+ s(wind_speed) + s(dew) + s(solar) + s(rainfall), 
             data = train, family = quasipoisson())
capture.output(summary(gam.7))[78:80]
## [1] "R-sq.(adj) =   0.87   Deviance explained = 87.7%"
## [2] "GCV = 67.963  Scale est. = 65.366    n = 6785"   
## [3] NA
anova(gam.6, gam.7, test="F")$Pr
## [1]           NA 4.724427e-21

The last model shows the highest value of an adjusted R-squared (0.87). Moreover, the ANOVA output indicates that there is strong evidence that the gam.7 model better fits the data.

4.3 Model Prediction

Make predictions on the test data for the models with high R-squared values.

pred.gam5 <- predict.gam(gam.5, newdata = test)
pred.gam6 <- predict.gam(gam.6, newdata = test)
pred.gam <- predict.gam(gam.7, newdata = test)

Compute R-squared to evaluate out-of-sample performance.

cor(pred.gam5, test$bike_count)^2
## [1] 0.6716468
cor(pred.gam6, test$bike_count)^2
## [1] 0.6749364
cor(pred.gam, test$bike_count)^2
## [1] 0.6758239
R.gam <- cor(pred.gam, test$bike_count)^2

4.3.1 Visualisation of Residuals for the Final Model

Conclusion: The cross-validation R-squared for the most complex model dropped consistently from the in-sample value (87% to 68%). The same is applicable for other models. However, the most complex model still shows the best performance (the highest R-squared). Thus, we will use gam.7 for the final predictions.

5 GLM with Poisson

5.1 Define Use Case and Choose a Model

library(car)
library("performance")
boxplot(bike_count ~ month, data = bikes, col=c('azure3', 'azure3','mistyrose', 
        'mistyrose','mistyrose','powderblue','powderblue', 'powderblue',
        'darkgoldenrod2','darkgoldenrod2','darkgoldenrod2', 'azure3'))
abline(h=500,lwd=2,lty=2, col="blue")

From the graph, it is observed that amount of rented bike between April and November and on non-holiday are higher than 500, which close to median (524).

This model is for predicting maximum number of bikes needed. The response variable is the amount of rented bike. It is not continuous, not negative, integer and countable. Therefore, the Poisson Model is applied. Since the amount of rental bikes from April and November and non-holidays is relatively high, we only use the data that meets the above conditions and delete the dates that operation was not provided. The choosing process is processed in both train and test dataset, which are named as train.p and test.p.

train.p <- subset(train, month %in% c(4,5,6,7,8,9,10,11))
train.p <- subset(train.p, holiday=="No Holiday")
train.p <- subset(train.p, select = c(-Date, -year, -day, -wday, -season, 
                                      -holiday, -season, -highdemand))
test.p <- subset(test, month %in% c(4,5,6,7,8,9,10,11))
test.p <-subset(test.p, holiday=="No Holiday")
test.p <-subset(test.p, select = c(-Date, -year, -day, -wday, -season, 
                                   -holiday, -season, -highdemand))

5.2 Build a model

5.2.1 A Full Posisson Model

glm.full0 <- glm(bike_count ~ month + hour + temperature + humidity + wind_speed + 
                       visibility + dew + solar + rainfall + snowfall, family = "poisson", 
                 data = train.p)
summary(glm.full0)
## 
## Call:
## glm(formula = bike_count ~ month + hour + temperature + humidity + 
##     wind_speed + visibility + dew + solar + rainfall + snowfall, 
##     family = "poisson", data = train.p)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -53.489   -6.047    0.427    5.934  101.484  
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  7.187e+00  1.104e-02  650.730  < 2e-16 ***
## month5       2.103e-01  2.326e-03   90.432  < 2e-16 ***
## month6       3.315e-01  2.694e-03  123.031  < 2e-16 ***
## month7       2.647e-02  3.571e-03    7.413 1.23e-13 ***
## month8      -1.039e-01  3.741e-03  -27.763  < 2e-16 ***
## month9       2.158e-01  2.908e-03   74.195  < 2e-16 ***
## month10      2.615e-01  2.201e-03  118.802  < 2e-16 ***
## month11      8.230e-02  2.539e-03   32.417  < 2e-16 ***
## hour1       -2.600e-01  4.143e-03  -62.761  < 2e-16 ***
## hour2       -5.925e-01  4.639e-03 -127.722  < 2e-16 ***
## hour3       -9.289e-01  5.327e-03 -174.365  < 2e-16 ***
## hour4       -1.350e+00  6.169e-03 -218.877  < 2e-16 ***
## hour5       -1.291e+00  6.171e-03 -209.161  < 2e-16 ***
## hour6       -5.290e-01  4.618e-03 -114.548  < 2e-16 ***
## hour7        1.331e-01  3.843e-03   34.633  < 2e-16 ***
## hour8        6.088e-01  3.498e-03  174.058  < 2e-16 ***
## hour9        1.166e-01  3.954e-03   29.490  < 2e-16 ***
## hour10      -1.954e-01  4.308e-03  -45.363  < 2e-16 ***
## hour11      -1.610e-01  4.435e-03  -36.290  < 2e-16 ***
## hour12      -3.026e-02  4.477e-03   -6.759 1.39e-11 ***
## hour13      -1.259e-02  4.441e-03   -2.834   0.0046 ** 
## hour14      -4.752e-02  4.387e-03  -10.832  < 2e-16 ***
## hour15       9.442e-02  4.150e-03   22.751  < 2e-16 ***
## hour16       2.138e-01  3.888e-03   55.003  < 2e-16 ***
## hour17       4.935e-01  3.612e-03  136.637  < 2e-16 ***
## hour18       8.234e-01  3.352e-03  245.619  < 2e-16 ***
## hour19       6.727e-01  3.368e-03  199.752  < 2e-16 ***
## hour20       6.071e-01  3.382e-03  179.495  < 2e-16 ***
## hour21       6.001e-01  3.354e-03  178.909  < 2e-16 ***
## hour22       5.033e-01  3.437e-03  146.443  < 2e-16 ***
## hour23       2.238e-01  3.658e-03   61.164  < 2e-16 ***
## temperature -5.449e-03  4.030e-04  -13.520  < 2e-16 ***
## humidity    -1.378e-02  1.143e-04 -120.599  < 2e-16 ***
## wind_speed  -1.413e-02  6.101e-04  -23.155  < 2e-16 ***
## visibility   3.727e-05  1.316e-06   28.320  < 2e-16 ***
## dew          2.725e-02  4.062e-04   67.078  < 2e-16 ***
## solar        2.612e-02  1.277e-03   20.453  < 2e-16 ***
## rainfall    -4.973e-01  2.387e-03 -208.318  < 2e-16 ***
## snowfall    -1.409e-01  3.153e-03  -44.673  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2092688  on 4309  degrees of freedom
## Residual deviance:  530005  on 4271  degrees of freedom
## AIC: 565902
## 
## Number of Fisher Scoring iterations: 7

Build the first model (glm.full0) with all variables. All variables are significant.

vif(glm.full0)
##                  GVIF Df GVIF^(1/(2*Df))
## month       11.115811  7        1.187709
## hour         8.676991 23        1.048092
## temperature 43.321507  1        6.581908
## humidity    17.568770  1        4.191512
## wind_speed   1.428760  1        1.195308
## visibility   1.878927  1        1.370739
## dew         52.604668  1        7.252908
## solar        6.269015  1        2.503800
## rainfall     1.118601  1        1.057639
## snowfall     1.045478  1        1.022486

After measuring collinearity, the variable “month”, temperature“,”humidity" and “dew” showed serious collinearity problems. The variables “hour” and “solar” have collinearity problems as well (GVIFs are above 5).

Remove ‘humidity’: The “humidity” in the data is ‘relative humidity’. Relative humidity is the percent of saturation at a given temperature; it depends on moisture content and temperature. It could be calculated with temperature and dew point temperature. Dew point is the temperature at which the air is saturated (100 percent relative humidity). It is dependent on only the amount of moisture in the air. For avoiding collinearity, ‘humidity’ is removed.

glm.full1 <- glm(bike_count ~  month + hour + temperature + wind_speed + 
                            visibility + dew + solar + rainfall + snowfall, 
                 family = "poisson", data = train.p)
summary(glm.full1)
## 
## Call:
## glm(formula = bike_count ~ month + hour + temperature + wind_speed + 
##     visibility + dew + solar + rainfall + snowfall, family = "poisson", 
##     data = train.p)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -54.707   -6.115    0.562    5.997  107.767  
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)  5.960e+00  4.306e-03 1384.029  < 2e-16 ***
## month5       2.299e-01  2.321e-03   99.093  < 2e-16 ***
## month6       3.455e-01  2.696e-03  128.176  < 2e-16 ***
## month7       3.701e-02  3.572e-03   10.362  < 2e-16 ***
## month8      -9.600e-02  3.739e-03  -25.674  < 2e-16 ***
## month9       2.266e-01  2.910e-03   77.880  < 2e-16 ***
## month10      2.816e-01  2.199e-03  128.060  < 2e-16 ***
## month11      1.133e-01  2.525e-03   44.892  < 2e-16 ***
## hour1       -2.597e-01  4.143e-03  -62.677  < 2e-16 ***
## hour2       -5.911e-01  4.639e-03 -127.417  < 2e-16 ***
## hour3       -9.305e-01  5.328e-03 -174.667  < 2e-16 ***
## hour4       -1.350e+00  6.169e-03 -218.889  < 2e-16 ***
## hour5       -1.294e+00  6.172e-03 -209.709  < 2e-16 ***
## hour6       -5.294e-01  4.619e-03 -114.621  < 2e-16 ***
## hour7        1.451e-01  3.839e-03   37.799  < 2e-16 ***
## hour8        6.159e-01  3.496e-03  176.179  < 2e-16 ***
## hour9        1.376e-01  3.946e-03   34.880  < 2e-16 ***
## hour10      -1.797e-01  4.300e-03  -41.795  < 2e-16 ***
## hour11      -1.462e-01  4.424e-03  -33.045  < 2e-16 ***
## hour12      -2.407e-02  4.471e-03   -5.383 7.33e-08 ***
## hour13      -1.070e-02  4.436e-03   -2.411   0.0159 *  
## hour14      -5.369e-02  4.383e-03  -12.250  < 2e-16 ***
## hour15       8.842e-02  4.147e-03   21.321  < 2e-16 ***
## hour16       2.131e-01  3.882e-03   54.882  < 2e-16 ***
## hour17       4.973e-01  3.609e-03  137.807  < 2e-16 ***
## hour18       8.353e-01  3.349e-03  249.442  < 2e-16 ***
## hour19       6.918e-01  3.362e-03  205.772  < 2e-16 ***
## hour20       6.255e-01  3.378e-03  185.157  < 2e-16 ***
## hour21       6.167e-01  3.351e-03  184.001  < 2e-16 ***
## hour22       5.164e-01  3.435e-03  150.341  < 2e-16 ***
## hour23       2.321e-01  3.658e-03   63.447  < 2e-16 ***
## temperature  3.651e-02  1.967e-04  185.625  < 2e-16 ***
## wind_speed  -1.238e-02  6.089e-04  -20.330  < 2e-16 ***
## visibility   6.444e-05  1.292e-06   49.873  < 2e-16 ***
## dew         -1.751e-02  1.554e-04 -112.709  < 2e-16 ***
## solar        2.950e-02  1.273e-03   23.170  < 2e-16 ***
## rainfall    -5.537e-01  2.427e-03 -228.137  < 2e-16 ***
## snowfall    -1.680e-01  3.159e-03  -53.176  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2092688  on 4309  degrees of freedom
## Residual deviance:  543634  on 4272  degrees of freedom
## AIC: 579529
## 
## Number of Fisher Scoring iterations: 7

All variables are significant.

vif(glm.full1)
##                  GVIF Df GVIF^(1/(2*Df))
## month       10.925637  7        1.186246
## hour         8.452195 23        1.047494
## temperature 10.281605  1        3.206494
## wind_speed   1.427537  1        1.194796
## visibility   1.826441  1        1.351459
## dew          7.752833  1        2.784391
## solar        6.261017  1        2.502202
## rainfall     1.080463  1        1.039453
## snowfall     1.040608  1        1.020102

The serious level of collinearity decreased, however, the GVIF of variable “month” and temperature" are still higher than 10. As we all know, ‘temperature’ varies with ‘month’, hence, there must be a correlation or collinearity between them. On the other hand, ‘month’ is related to summer vacation, which may influence the amount of rental bike as well. As a result, both variable are kept.

drop1(glm.full1, test = "F")
## Single term deletions
## 
## Model:
## bike_count ~ month + hour + temperature + wind_speed + visibility + 
##     dew + solar + rainfall + snowfall
##             Df Deviance     AIC   F value    Pr(>F)    
## <none>           543634  579529                        
## month        7   624543  660424   90.8287 < 2.2e-16 ***
## hour        23  1259771 1295619  244.6768 < 2.2e-16 ***
## temperature  1   578264  614157  272.1298 < 2.2e-16 ***
## wind_speed   1   544048  579941    3.2569   0.07119 .  
## visibility   1   546127  582019   19.5880 9.847e-06 ***
## dew          1   556263  592156   99.2445 < 2.2e-16 ***
## solar        1   544172  580064    4.2251   0.03989 *  
## rainfall     1   679300  715192 1066.0917 < 2.2e-16 ***
## snowfall     1   547260  583153   28.4985 9.864e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glm_w <- glm(bike_count ~ month + hour + temperature +  
                       visibility + dew + solar + rainfall + snowfall, 
             family = "poisson", data = train.p)
capture.output(summary(glm_w))[56]
## [1] "AIC: 579941"

When excluding the “wind speed”, the AIC didn’t decrease. So, we keep the variables “wind speed”. Given by the graphic analysis, the temperature variable doesn’t show the liner trend. It seems a quadratic or cubic effect.

glm.full2 <- glm(bike_count ~ month + hour + poly(temperature, degree=2) + wind_speed + 
                 visibility + dew + solar + rainfall + snowfall, 
                 family = "poisson", data = train.p)
summary(glm.full2)
## 
## Call:
## glm(formula = bike_count ~ month + hour + poly(temperature, degree = 2) + 
##     wind_speed + visibility + dew + solar + rainfall + snowfall, 
##     family = "poisson", data = train.p)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -51.852   -5.413    0.356    5.532  108.786  
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     6.593e+00  4.204e-03 1568.28   <2e-16 ***
## month5                          2.010e-01  2.311e-03   86.95   <2e-16 ***
## month6                          3.608e-01  2.677e-03  134.78   <2e-16 ***
## month7                          1.990e-01  3.617e-03   55.02   <2e-16 ***
## month8                          9.300e-02  3.804e-03   24.45   <2e-16 ***
## month9                          2.321e-01  2.889e-03   80.34   <2e-16 ***
## month10                         3.075e-01  2.209e-03  139.22   <2e-16 ***
## month11                         2.649e-01  2.643e-03  100.23   <2e-16 ***
## hour1                          -2.504e-01  4.143e-03  -60.43   <2e-16 ***
## hour2                          -5.893e-01  4.639e-03 -127.05   <2e-16 ***
## hour3                          -9.207e-01  5.327e-03 -172.82   <2e-16 ***
## hour4                          -1.340e+00  6.169e-03 -217.28   <2e-16 ***
## hour5                          -1.281e+00  6.171e-03 -207.66   <2e-16 ***
## hour6                          -5.148e-01  4.618e-03 -111.47   <2e-16 ***
## hour7                           1.426e-01  3.838e-03   37.16   <2e-16 ***
## hour8                           5.981e-01  3.501e-03  170.85   <2e-16 ***
## hour9                           9.245e-02  3.961e-03   23.34   <2e-16 ***
## hour10                         -2.509e-01  4.329e-03  -57.97   <2e-16 ***
## hour11                         -2.298e-01  4.464e-03  -51.48   <2e-16 ***
## hour12                         -1.121e-01  4.515e-03  -24.82   <2e-16 ***
## hour13                         -9.823e-02  4.481e-03  -21.92   <2e-16 ***
## hour14                         -1.135e-01  4.411e-03  -25.74   <2e-16 ***
## hour15                          3.707e-02  4.166e-03    8.90   <2e-16 ***
## hour16                          1.800e-01  3.887e-03   46.31   <2e-16 ***
## hour17                          4.869e-01  3.607e-03  135.00   <2e-16 ***
## hour18                          8.398e-01  3.345e-03  251.11   <2e-16 ***
## hour19                          6.970e-01  3.359e-03  207.50   <2e-16 ***
## hour20                          6.320e-01  3.377e-03  187.16   <2e-16 ***
## hour21                          6.199e-01  3.351e-03  184.98   <2e-16 ***
## hour22                          5.165e-01  3.435e-03  150.36   <2e-16 ***
## hour23                          2.309e-01  3.658e-03   63.13   <2e-16 ***
## poly(temperature, degree = 2)1  2.168e+01  1.103e-01  196.62   <2e-16 ***
## poly(temperature, degree = 2)2 -8.864e+00  4.514e-02 -196.36   <2e-16 ***
## wind_speed                     -6.804e-03  6.088e-04  -11.18   <2e-16 ***
## visibility                      6.233e-05  1.289e-06   48.34   <2e-16 ***
## dew                            -1.994e-02  1.547e-04 -128.86   <2e-16 ***
## solar                           7.580e-02  1.299e-03   58.34   <2e-16 ***
## rainfall                       -5.637e-01  2.446e-03 -230.48   <2e-16 ***
## snowfall                       -9.723e-02  3.130e-03  -31.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2092688  on 4309  degrees of freedom
## Residual deviance:  503687  on 4271  degrees of freedom
## AIC: 539584
## 
## Number of Fisher Scoring iterations: 7
glm.full3 <- glm(bike_count ~ month + hour + poly(temperature, degree=3) +  wind_speed + 
                visibility + dew + solar + rainfall + snowfall, 
                family = "poisson", data = train.p)
summary(glm.full3)
## 
## Call:
## glm(formula = bike_count ~ month + hour + poly(temperature, degree = 3) + 
##     wind_speed + visibility + dew + solar + rainfall + snowfall, 
##     family = "poisson", data = train.p)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -53.206   -5.242    0.370    5.350  108.371  
## 
## Coefficients:
##                                  Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)                     6.653e+00  4.224e-03 1575.027  < 2e-16 ***
## month5                          1.565e-01  2.340e-03   66.860  < 2e-16 ***
## month6                          2.482e-01  2.789e-03   88.990  < 2e-16 ***
## month7                          8.645e-02  3.697e-03   23.382  < 2e-16 ***
## month8                         -1.215e-02  3.863e-03   -3.145  0.00166 ** 
## month9                          1.281e-01  2.985e-03   42.919  < 2e-16 ***
## month10                         3.331e-01  2.221e-03  149.998  < 2e-16 ***
## month11                         2.542e-01  2.667e-03   95.315  < 2e-16 ***
## hour1                          -2.518e-01  4.143e-03  -60.775  < 2e-16 ***
## hour2                          -5.866e-01  4.639e-03 -126.449  < 2e-16 ***
## hour3                          -9.237e-01  5.327e-03 -173.384  < 2e-16 ***
## hour4                          -1.344e+00  6.169e-03 -217.800  < 2e-16 ***
## hour5                          -1.287e+00  6.172e-03 -208.483  < 2e-16 ***
## hour6                          -5.168e-01  4.618e-03 -111.901  < 2e-16 ***
## hour7                           1.426e-01  3.838e-03   37.152  < 2e-16 ***
## hour8                           5.905e-01  3.500e-03  168.713  < 2e-16 ***
## hour9                           8.845e-02  3.958e-03   22.345  < 2e-16 ***
## hour10                         -2.505e-01  4.327e-03  -57.887  < 2e-16 ***
## hour11                         -2.286e-01  4.462e-03  -51.230  < 2e-16 ***
## hour12                         -1.181e-01  4.514e-03  -26.162  < 2e-16 ***
## hour13                         -1.008e-01  4.482e-03  -22.487  < 2e-16 ***
## hour14                         -1.151e-01  4.413e-03  -26.079  < 2e-16 ***
## hour15                          3.297e-02  4.167e-03    7.912 2.53e-15 ***
## hour16                          1.770e-01  3.886e-03   45.564  < 2e-16 ***
## hour17                          4.815e-01  3.605e-03  133.568  < 2e-16 ***
## hour18                          8.350e-01  3.343e-03  249.737  < 2e-16 ***
## hour19                          6.942e-01  3.358e-03  206.745  < 2e-16 ***
## hour20                          6.248e-01  3.376e-03  185.046  < 2e-16 ***
## hour21                          6.148e-01  3.351e-03  183.475  < 2e-16 ***
## hour22                          5.117e-01  3.435e-03  148.976  < 2e-16 ***
## hour23                          2.298e-01  3.658e-03   62.816  < 2e-16 ***
## poly(temperature, degree = 3)1  2.335e+01  1.103e-01  211.659  < 2e-16 ***
## poly(temperature, degree = 3)2 -7.265e+00  4.567e-02 -159.057  < 2e-16 ***
## poly(temperature, degree = 3)3 -5.691e+00  4.033e-02 -141.129  < 2e-16 ***
## wind_speed                     -1.037e-02  6.087e-04  -17.034  < 2e-16 ***
## visibility                      6.394e-05  1.290e-06   49.548  < 2e-16 ***
## dew                            -1.920e-02  1.548e-04 -124.062  < 2e-16 ***
## solar                           7.049e-02  1.298e-03   54.313  < 2e-16 ***
## rainfall                       -5.584e-01  2.437e-03 -229.179  < 2e-16 ***
## snowfall                       -1.494e-01  3.175e-03  -47.057  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2092688  on 4309  degrees of freedom
## Residual deviance:  484023  on 4270  degrees of freedom
## AIC: 519922
## 
## Number of Fisher Scoring iterations: 7

After adding cubic term, the model “glm.full3” shows lower AIC value.

Poisson GLM assumes that the mean and variance of the response variable increase at the same rate. If the residual deviance of the fitted model is bigger than the residual degrees of freedom, then the model has overdispersion. Overdispersion indicates that a Poisson distribution does not adequately model the variance and is not appropriate for the analysis. The values of model “glm.full3” exceeding 110.5332 are problematic, when it should be 1.

#Goodness of fit = Residual Deviance / degrees of freedom
ods <- glm.full3$deviance / glm.full3$df.residual
ods
## [1] 113.3544

5.2.2 Build the Quasipoisson Model

glm.full3q <- glm(bike_count ~ month + hour + poly(temperature, degree=3) + wind_speed + 
                  visibility + dew + solar + rainfall + snowfall, 
                  family = "quasipoisson", data = train.p)
#summary(glm.full3q)
capture.output(summary(glm.full3q))[12:52]
##  [1] "                                 Estimate Std. Error t value Pr(>|t|)"
##  [2] "(Intercept)                     6.653e+00  6.506e+00   1.023    0.307"
##  [3] "month5                          1.565e-01  3.604e+00   0.043    0.965"
##  [4] "month6                          2.482e-01  4.296e+00   0.058    0.954"
##  [5] "month7                          8.645e-02  5.695e+00   0.015    0.988"
##  [6] "month8                         -1.215e-02  5.950e+00  -0.002    0.998"
##  [7] "month9                          1.281e-01  4.598e+00   0.028    0.978"
##  [8] "month10                         3.331e-01  3.421e+00   0.097    0.922"
##  [9] "month11                         2.542e-01  4.108e+00   0.062    0.951"
## [10] "hour1                          -2.518e-01  6.381e+00  -0.039    0.969"
## [11] "hour2                          -5.866e-01  7.145e+00  -0.082    0.935"
## [12] "hour3                          -9.237e-01  8.205e+00  -0.113    0.910"
## [13] "hour4                          -1.344e+00  9.502e+00  -0.141    0.888"
## [14] "hour5                          -1.287e+00  9.506e+00  -0.135    0.892"
## [15] "hour6                          -5.168e-01  7.113e+00  -0.073    0.942"
## [16] "hour7                           1.426e-01  5.911e+00   0.024    0.981"
## [17] "hour8                           5.905e-01  5.391e+00   0.110    0.913"
## [18] "hour9                           8.845e-02  6.097e+00   0.015    0.988"
## [19] "hour10                         -2.505e-01  6.664e+00  -0.038    0.970"
## [20] "hour11                         -2.286e-01  6.873e+00  -0.033    0.973"
## [21] "hour12                         -1.181e-01  6.953e+00  -0.017    0.986"
## [22] "hour13                         -1.008e-01  6.903e+00  -0.015    0.988"
## [23] "hour14                         -1.151e-01  6.797e+00  -0.017    0.986"
## [24] "hour15                          3.297e-02  6.417e+00   0.005    0.996"
## [25] "hour16                          1.770e-01  5.985e+00   0.030    0.976"
## [26] "hour17                          4.815e-01  5.553e+00   0.087    0.931"
## [27] "hour18                          8.350e-01  5.149e+00   0.162    0.871"
## [28] "hour19                          6.942e-01  5.172e+00   0.134    0.893"
## [29] "hour20                          6.248e-01  5.200e+00   0.120    0.904"
## [30] "hour21                          6.148e-01  5.161e+00   0.119    0.905"
## [31] "hour22                          5.117e-01  5.290e+00   0.097    0.923"
## [32] "hour23                          2.298e-01  5.634e+00   0.041    0.967"
## [33] "poly(temperature, degree = 3)1  2.335e+01  1.699e+02   0.137    0.891"
## [34] "poly(temperature, degree = 3)2 -7.265e+00  7.035e+01  -0.103    0.918"
## [35] "poly(temperature, degree = 3)3 -5.691e+00  6.211e+01  -0.092    0.927"
## [36] "wind_speed                     -1.037e-02  9.376e-01  -0.011    0.991"
## [37] "visibility                      6.394e-05  1.987e-03   0.032    0.974"
## [38] "dew                            -1.920e-02  2.384e-01  -0.081    0.936"
## [39] "solar                           7.049e-02  1.999e+00   0.035    0.972"
## [40] "rainfall                       -5.584e-01  3.753e+00  -0.149    0.882"
## [41] "snowfall                       -1.494e-01  4.890e+00  -0.031    0.976"
capture.output(summary(glm.full1))[58]
## [1] "AIC: 579529"
drop1(glm.full3q,test="F")
## Single term deletions
## 
## Model:
## bike_count ~ month + hour + poly(temperature, degree = 3) + wind_speed + 
##     visibility + dew + solar + rainfall + snowfall
##                               Df Deviance   F value    Pr(>F)    
## <none>                             484023                        
## month                          7   524092   50.4970 < 2.2e-16 ***
## hour                          23  1207089  277.3392 < 2.2e-16 ***
## poly(temperature, degree = 3)  3   578264  277.1263 < 2.2e-16 ***
## wind_speed                     1   484314    2.5656    0.1093    
## visibility                     1   486484   21.7062 3.274e-06 ***
## dew                            1   499307  134.8271 < 2.2e-16 ***
## solar                          1   486985   26.1297 3.332e-07 ***
## rainfall                       1   623286 1228.5614 < 2.2e-16 ***
## snowfall                       1   486757   24.1177 9.401e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

For considering overdispersion effect, family “quasipoisson” is used to build the new model. Most of p-values became larger (p > 0.05). It shows that all variables are insignificant. However, when the “quasipoisson” model is tested by “F test”, it reveals that only the p value of “wind-speed” is larger than 0.05. Two results are completely different, therefore, negative binomial model is tested.

5.2.3 Build the Negative Binomial Model

library(glmmTMB)
glm.full3nb0 <- glmmTMB(bike_count ~  month + hour + poly(temperature, degree=3) 
                        + wind_speed + visibility + dew + solar + rainfall+ snowfall, 
                        data = train.p ,family="nbinom2")
summary(glm.full3nb0)
##  Family: nbinom2  ( log )
## Formula:          
## bike_count ~ month + hour + poly(temperature, degree = 3) + wind_speed +  
##     visibility + dew + solar + rainfall + snowfall
## Data: train.p
## 
##      AIC      BIC   logLik deviance df.resid 
##       NA       NA       NA       NA     4269 
## 
## 
## Overdispersion parameter for nbinom2 family (): 3.71 
## 
## Conditional model:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     6.6398461  0.0539453  123.08  < 2e-16 ***
## month5                          0.1550457  0.0344568    4.50 6.80e-06 ***
## month6                          0.2845797  0.0427726    6.65 2.87e-11 ***
## month7                          0.0103328  0.0525515    0.20 0.844123    
## month8                         -0.1210934  0.0539623   -2.24 0.024830 *  
## month9                          0.0802032  0.0410570    1.95 0.050765 .  
## month10                         0.3858171  0.0320035   12.06  < 2e-16 ***
## month11                         0.4146592  0.0369253   11.23  < 2e-16 ***
## hour1                          -0.2364749  0.0546135   -4.33 1.49e-05 ***
## hour2                          -0.5672537  0.0548426  -10.34  < 2e-16 ***
## hour3                          -0.8963052  0.0555463  -16.14  < 2e-16 ***
## hour4                          -1.2941545  0.0549593  -23.55  < 2e-16 ***
## hour5                          -1.2377618  0.0558382  -22.17  < 2e-16 ***
## hour6                          -0.4688265  0.0552406   -8.49  < 2e-16 ***
## hour7                           0.2105909  0.0555062    3.79 0.000148 ***
## hour8                           0.6161298  0.0558433   11.03  < 2e-16 ***
## hour9                           0.0621336  0.0580158    1.07 0.284180    
## hour10                         -0.3323503  0.0608565   -5.46 4.73e-08 ***
## hour11                         -0.3678135  0.0645079   -5.70 1.19e-08 ***
## hour12                         -0.2946353  0.0666299   -4.42 9.78e-06 ***
## hour13                         -0.2307972  0.0669570   -3.45 0.000567 ***
## hour14                         -0.3205165  0.0656518   -4.88 1.05e-06 ***
## hour15                         -0.1668450  0.0633620   -2.63 0.008458 ** 
## hour16                          0.0102945  0.0608231    0.17 0.865597    
## hour17                          0.3101779  0.0588425    5.27 1.35e-07 ***
## hour18                          0.7510536  0.0571758   13.14  < 2e-16 ***
## hour19                          0.5957831  0.0562003   10.60  < 2e-16 ***
## hour20                          0.5553654  0.0554031   10.02  < 2e-16 ***
## hour21                          0.5770352  0.0546088   10.57  < 2e-16 ***
## hour22                          0.4786738  0.0548400    8.73  < 2e-16 ***
## hour23                          0.2052227  0.0545917    3.76 0.000170 ***
## poly(temperature, degree = 3)1 33.9182267  1.8507807   18.33  < 2e-16 ***
## poly(temperature, degree = 3)2 -7.5115579  0.6748325  -11.13  < 2e-16 ***
## poly(temperature, degree = 3)3 -5.6975371  0.5927848   -9.61  < 2e-16 ***
## wind_speed                     -0.0162001  0.0101921   -1.59 0.111950    
## visibility                      0.0001531         NA      NA       NA    
## dew                            -0.0304058  0.0021653  -14.04  < 2e-16 ***
## solar                           0.1099859  0.0209383    5.25 1.50e-07 ***
## rainfall                       -0.1157364  0.0040745  -28.41  < 2e-16 ***
## snowfall                       -0.0980377  0.0255153   -3.84 0.000122 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
glm.full3nb <- glmmTMB(bike_count ~  month + hour + poly(temperature, degree=3) 
                       + dew + solar + rainfall+ snowfall, data = train.p, family="nbinom2")
summary(glm.full3nb)
##  Family: nbinom2  ( log )
## Formula:          
## bike_count ~ month + hour + poly(temperature, degree = 3) + dew +  
##     solar + rainfall + snowfall
## Data: train.p
## 
##      AIC      BIC   logLik deviance df.resid 
##  63028.9  63277.3 -31475.5  62950.9     4271 
## 
## 
## Overdispersion parameter for nbinom2 family (): 3.66 
## 
## Conditional model:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     6.886552   0.054655  126.00  < 2e-16 ***
## month5                          0.215965   0.034991    6.17 6.75e-10 ***
## month6                          0.362799   0.043449    8.35  < 2e-16 ***
## month7                          0.184775   0.054551    3.39 0.000706 ***
## month8                          0.073191   0.056507    1.30 0.195228    
## month9                          0.244065   0.043318    5.63 1.76e-08 ***
## month10                         0.453604   0.032264   14.06  < 2e-16 ***
## month11                         0.381302   0.036399   10.48  < 2e-16 ***
## hour1                          -0.237790   0.054932   -4.33 1.50e-05 ***
## hour2                          -0.576276   0.055159  -10.45  < 2e-16 ***
## hour3                          -0.906657   0.055794  -16.25  < 2e-16 ***
## hour4                          -1.304172   0.055261  -23.60  < 2e-16 ***
## hour5                          -1.255152   0.056071  -22.39  < 2e-16 ***
## hour6                          -0.476170   0.055407   -8.59  < 2e-16 ***
## hour7                           0.193851   0.055768    3.48 0.000509 ***
## hour8                           0.598098   0.056188   10.64  < 2e-16 ***
## hour9                           0.039317   0.058416    0.67 0.500911    
## hour10                         -0.345654   0.061237   -5.64 1.66e-08 ***
## hour11                         -0.371640   0.064917   -5.72 1.04e-08 ***
## hour12                         -0.299812   0.066994   -4.48 7.63e-06 ***
## hour13                         -0.238073   0.067189   -3.54 0.000395 ***
## hour14                         -0.335038   0.065722   -5.10 3.44e-07 ***
## hour15                         -0.179450   0.063316   -2.83 0.004594 ** 
## hour16                         -0.002918   0.060565   -0.05 0.961576    
## hour17                          0.303001   0.058442    5.18 2.16e-07 ***
## hour18                          0.745131   0.056939   13.09  < 2e-16 ***
## hour19                          0.594710   0.056082   10.60  < 2e-16 ***
## hour20                          0.554264   0.055415   10.00  < 2e-16 ***
## hour21                          0.578150   0.054889   10.53  < 2e-16 ***
## hour22                          0.483622   0.055211    8.76  < 2e-16 ***
## hour23                          0.203223   0.054968    3.70 0.000218 ***
## poly(temperature, degree = 3)1 36.299012   1.873175   19.38  < 2e-16 ***
## poly(temperature, degree = 3)2 -7.521284   0.680379  -11.05  < 2e-16 ***
## poly(temperature, degree = 3)3 -5.676800   0.596561   -9.52  < 2e-16 ***
## dew                            -0.041416   0.002357  -17.57  < 2e-16 ***
## solar                           0.097043   0.020944    4.63 3.60e-06 ***
## rainfall                       -0.116998   0.004094  -28.58  < 2e-16 ***
## snowfall                       -0.104882   0.025385   -4.13 3.60e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
detach("package:glmmTMB", unload=TRUE)

Given by the result of negative binomial model, “wind-speed” and “visibility” are not significant. They are therefore removed from the model. And a new negative binomial model (glm.full3nb) is built.

5.3 Evaluating and Comparing Model Performance

Comparing three models, Negative Binomial model owns the smallest AIC value, although the RMSE is slightly higher than others.

#Poisson
model_performance(glm.full3)
#Quasipoisson
model_performance(glm.full3q)
#Negative Binomial
model_performance(glm.full3nb)

5.3.1 Predicted vs.Residuals

From the Predicted vs.Residuals plot, Poisson and Quasipoisson model are almost the same. Most residuals are between -50 and 50. Compared to the predicted value, these residual values seem reasonable. In contrary, residuals of Negative Binomial model are very large compared to its predicted value.

par(mfrow=c(1,2))
#Poisson
plot(predict(glm.full3,type="response",),residuals(glm.full3), main="Poisson",
ylab="Residuals", xlab="Predicted", col="skyblue")
abline(h=0,lty=1,col="gray")
lines(lowess(predict(glm.full3,type="response"),residuals(glm.full3)),lwd=2,lty=2,col="blue")
#Quasipoisson
plot(predict(glm.full3q,type="response"),residuals(glm.full3q), main="Quasipoisson",
ylab="Residuals", xlab="Predicted", col="skyblue")
abline(h=0,lty=1,co1="gray")
lines(lowess(predict(glm.full3q,type="response"),residuals(glm.full3q)),lwd=2,lty=2,col="blue")

par(mfrow=c(1,1))
#Negative Binomial
plot(predict(glm.full3nb,type="response"),residuals(glm.full3nb), main="Negative Binomial",
ylab="Residuals", xlab="Predicted", col="skyblue")
abline(h=0,lty=1,col="gray")
lines(lowess(predict(glm.full3nb,type="response"),residuals(glm.full3nb)),lwd=2,lty=2,col="blue")

5.3.2 Model Prediction

pred.glm3 <- predict(glm.full3nb, type = "response", newdata = test.p)
pred.glmq <- predict(glm.full3q, type = "response", newdata = test.p)
pred.glmnb <- predict(glm.full3nb, type = "response", newdata = test.p)
#Poisson
R.glm3 <- cor(pred.glm3, test.p$bike_count)^2
#Quasipoisson
R.glmq <- cor(pred.glmq, test.p$bike_count)^2
#Negative Binomial
R.glmnb <- cor(pred.glmnb, test.p$bike_count)^2

Conclusion: Theoretically, Negative Binomial model is supposed to be the most reasonable of these three models. Since it doesn’t show overdispersion problem as Poisson model. And unlike Quasipoisson model, its variables are significant. However, its residuals are the largest among three models. It may be caused by the correlation or collinearity from “month”, “hour” and weather related variable.

6 GLM with Binomial

6.1 Model Fitting

ggplot(data = bikes, aes(x = highdemand, y = bike_count, fill = highdemand))+
    geom_boxplot(alpha=0.2) + labs(fill ="Demand > 500", y="Bike Count") + xlab("")

From the boxplots in the Graphic Analysis section, if-highdemand doesn’t show a significant difference among weekdays. Therefore, this variable would be removed from the beginning model.

glm_bike1 <- glm(highdemand ~ month + hour + temperature + humidity + wind_speed + visibility 
             + dew + solar + rainfall + snowfall + holiday, family = "binomial", data = train)
summary(glm_bike1)
## 
## Call:
## glm(formula = highdemand ~ month + hour + temperature + humidity + 
##     wind_speed + visibility + dew + solar + rainfall + snowfall + 
##     holiday, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7805  -0.1922   0.0363   0.2574   8.4904  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -6.938e-01  1.103e+00  -0.629 0.529431    
## month2             2.771e-01  3.462e-01   0.800 0.423519    
## month3             2.449e+00  3.357e-01   7.295 2.99e-13 ***
## month4             2.660e+00  3.712e-01   7.166 7.70e-13 ***
## month5             3.018e+00  4.186e-01   7.211 5.57e-13 ***
## month6             4.044e+00  4.865e-01   8.312  < 2e-16 ***
## month7             2.358e+00  5.214e-01   4.523 6.11e-06 ***
## month8             1.880e+00  5.407e-01   3.478 0.000506 ***
## month9             3.101e+00  4.808e-01   6.451 1.11e-10 ***
## month10            4.348e+00  3.981e-01  10.922  < 2e-16 ***
## month11            4.573e+00  3.444e-01  13.278  < 2e-16 ***
## month12            1.486e+00  3.265e-01   4.550 5.36e-06 ***
## hour1             -9.038e-01  2.506e-01  -3.607 0.000310 ***
## hour2             -2.552e+00  2.632e-01  -9.695  < 2e-16 ***
## hour3             -4.817e+00  4.020e-01 -11.982  < 2e-16 ***
## hour4             -1.987e+01  2.989e+02  -0.066 0.946995    
## hour5             -1.974e+01  2.997e+02  -0.066 0.947477    
## hour6             -1.995e+00  2.612e-01  -7.639 2.19e-14 ***
## hour7             -3.403e-01  2.602e-01  -1.308 0.190981    
## hour8              2.758e+00  3.291e-01   8.381  < 2e-16 ***
## hour9              5.675e-01  3.270e-01   1.735 0.082670 .  
## hour10            -1.118e+00  3.300e-01  -3.388 0.000704 ***
## hour11            -1.565e+00  3.554e-01  -4.404 1.06e-05 ***
## hour12            -1.163e+00  3.926e-01  -2.961 0.003064 ** 
## hour13            -1.168e+00  3.898e-01  -2.996 0.002740 ** 
## hour14            -1.578e+00  3.749e-01  -4.209 2.56e-05 ***
## hour15            -7.640e-01  3.669e-01  -2.082 0.037303 *  
## hour16            -5.292e-01  3.483e-01  -1.519 0.128699    
## hour17             9.493e-01  3.410e-01   2.784 0.005377 ** 
## hour18             2.857e+00  3.389e-01   8.430  < 2e-16 ***
## hour19             1.268e+00  3.242e-01   3.911 9.20e-05 ***
## hour20             1.139e+00  3.069e-01   3.710 0.000207 ***
## hour21             1.085e+00  3.047e-01   3.561 0.000369 ***
## hour22             1.128e+00  2.933e-01   3.844 0.000121 ***
## hour23             6.659e-01  2.793e-01   2.384 0.017118 *  
## temperature        7.894e-02  4.486e-02   1.760 0.078437 .  
## humidity          -6.828e-02  1.169e-02  -5.839 5.26e-09 ***
## wind_speed        -1.453e-01  5.653e-02  -2.571 0.010138 *  
## visibility         8.618e-05  1.117e-04   0.772 0.440350    
## dew                1.152e-01  4.548e-02   2.534 0.011283 *  
## solar              1.052e+00  1.542e-01   6.825 8.78e-12 ***
## rainfall          -1.703e+00  1.956e-01  -8.707  < 2e-16 ***
## snowfall          -8.547e-01  2.326e-01  -3.675 0.000238 ***
## holidayNo Holiday  1.195e+00  2.457e-01   4.864 1.15e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9397.0  on 6784  degrees of freedom
## Residual deviance: 2865.5  on 6741  degrees of freedom
## AIC: 2953.5
## 
## Number of Fisher Scoring iterations: 17

Visibility isn’t significant in this model. Therefore, it’d be removed in the next model.

glm_bike2 <- update(glm_bike1, . ~ . -visibility)
summary(glm_bike2)
## 
## Call:
## glm(formula = highdemand ~ month + hour + temperature + humidity + 
##     wind_speed + dew + solar + rainfall + snowfall + holiday, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7745  -0.1922   0.0363   0.2583   8.4904  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -0.52373    1.08287  -0.484 0.628631    
## month2              0.28685    0.34600   0.829 0.407079    
## month3              2.48574    0.33223   7.482 7.32e-14 ***
## month4              2.72475    0.36124   7.543 4.60e-14 ***
## month5              3.10162    0.40408   7.676 1.65e-14 ***
## month6              4.14295    0.46897   8.834  < 2e-16 ***
## month7              2.50865    0.48286   5.195 2.04e-07 ***
## month8              2.04771    0.49464   4.140 3.48e-05 ***
## month9              3.24488    0.44315   7.322 2.44e-13 ***
## month10             4.44995    0.37572  11.844  < 2e-16 ***
## month11             4.60928    0.34128  13.506  < 2e-16 ***
## month12             1.50369    0.32571   4.617 3.90e-06 ***
## hour1              -0.90589    0.25050  -3.616 0.000299 ***
## hour2              -2.55525    0.26311  -9.712  < 2e-16 ***
## hour3              -4.82368    0.40243 -11.986  < 2e-16 ***
## hour4             -19.89581  299.00803  -0.067 0.946948    
## hour5             -19.77708  299.89406  -0.066 0.947420    
## hour6              -2.00082    0.26107  -7.664 1.80e-14 ***
## hour7              -0.35047    0.25982  -1.349 0.177359    
## hour8               2.75119    0.32913   8.359  < 2e-16 ***
## hour9               0.55466    0.32662   1.698 0.089467 .  
## hour10             -1.12421    0.32980  -3.409 0.000652 ***
## hour11             -1.56726    0.35521  -4.412 1.02e-05 ***
## hour12             -1.15873    0.39234  -2.953 0.003143 ** 
## hour13             -1.16772    0.38932  -2.999 0.002706 ** 
## hour14             -1.56993    0.37476  -4.189 2.80e-05 ***
## hour15             -0.76049    0.36665  -2.074 0.038063 *  
## hour16             -0.53006    0.34830  -1.522 0.128040    
## hour17              0.94612    0.34109   2.774 0.005540 ** 
## hour18              2.85488    0.33885   8.425  < 2e-16 ***
## hour19              1.26808    0.32420   3.911 9.18e-05 ***
## hour20              1.13670    0.30676   3.706 0.000211 ***
## hour21              1.08006    0.30448   3.547 0.000389 ***
## hour22              1.12831    0.29322   3.848 0.000119 ***
## hour23              0.66314    0.27916   2.375 0.017525 *  
## temperature         0.07659    0.04485   1.708 0.087682 .  
## humidity           -0.06981    0.01155  -6.045 1.50e-09 ***
## wind_speed         -0.14263    0.05637  -2.530 0.011403 *  
## dew                 0.11438    0.04557   2.510 0.012064 *  
## solar               1.04284    0.15361   6.789 1.13e-11 ***
## rainfall           -1.71191    0.19541  -8.761  < 2e-16 ***
## snowfall           -0.86357    0.23320  -3.703 0.000213 ***
## holidayNo Holiday   1.19967    0.24586   4.879 1.06e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9397.0  on 6784  degrees of freedom
## Residual deviance: 2865.8  on 6742  degrees of freedom
## AIC: 2951.8
## 
## Number of Fisher Scoring iterations: 17

The AIC value decreases slightly in the result. Next, we’ll try to remove the temperature variable, which seems not significant in this model.

glm_bike3 <- update(glm_bike2, . ~ . -temperature)
summary(glm_bike3)
## 
## Call:
## glm(formula = highdemand ~ month + hour + humidity + wind_speed + 
##     dew + solar + rainfall + snowfall + holiday, family = "binomial", 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.7809  -0.1940   0.0372   0.2593   8.4904  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         1.073700   0.563336   1.906 0.056654 .  
## month2              0.352329   0.342329   1.029 0.303381    
## month3              2.515200   0.331156   7.595 3.07e-14 ***
## month4              2.799388   0.358021   7.819 5.32e-15 ***
## month5              3.198618   0.399836   8.000 1.25e-15 ***
## month6              4.263817   0.463821   9.193  < 2e-16 ***
## month7              2.647507   0.475746   5.565 2.62e-08 ***
## month8              2.185986   0.487735   4.482 7.40e-06 ***
## month9              3.357322   0.438122   7.663 1.82e-14 ***
## month10             4.502452   0.374430  12.025  < 2e-16 ***
## month11             4.658019   0.339528  13.719  < 2e-16 ***
## month12             1.491848   0.325124   4.589 4.46e-06 ***
## hour1              -0.906277   0.250302  -3.621 0.000294 ***
## hour2              -2.561136   0.263187  -9.731  < 2e-16 ***
## hour3              -4.876339   0.407576 -11.964  < 2e-16 ***
## hour4             -21.360835 486.595363  -0.044 0.964985    
## hour5             -21.335727 488.027075  -0.044 0.965129    
## hour6              -2.013862   0.261070  -7.714 1.22e-14 ***
## hour7              -0.364803   0.259759  -1.404 0.160203    
## hour8               2.720974   0.327573   8.306  < 2e-16 ***
## hour9               0.520355   0.326096   1.596 0.110554    
## hour10             -1.159124   0.329442  -3.518 0.000434 ***
## hour11             -1.595459   0.354734  -4.498 6.87e-06 ***
## hour12             -1.177347   0.392124  -3.002 0.002678 ** 
## hour13             -1.169983   0.390166  -2.999 0.002712 ** 
## hour14             -1.546614   0.374837  -4.126 3.69e-05 ***
## hour15             -0.726932   0.367113  -1.980 0.047689 *  
## hour16             -0.494561   0.348061  -1.421 0.155345    
## hour17              0.970022   0.340665   2.847 0.004407 ** 
## hour18              2.868751   0.338530   8.474  < 2e-16 ***
## hour19              1.279611   0.323058   3.961 7.47e-05 ***
## hour20              1.144665   0.305692   3.745 0.000181 ***
## hour21              1.080319   0.303612   3.558 0.000373 ***
## hour22              1.123475   0.292356   3.843 0.000122 ***
## hour23              0.661801   0.278816   2.374 0.017615 *  
## humidity           -0.087563   0.005371 -16.303  < 2e-16 ***
## wind_speed         -0.144830   0.056299  -2.573 0.010096 *  
## dew                 0.188817   0.014036  13.453  < 2e-16 ***
## solar               1.090947   0.151043   7.223 5.09e-13 ***
## rainfall           -1.674997   0.193273  -8.666  < 2e-16 ***
## snowfall           -0.870396   0.235197  -3.701 0.000215 ***
## holidayNo Holiday   1.225778   0.246854   4.966 6.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9397.0  on 6784  degrees of freedom
## Residual deviance: 2870.2  on 6743  degrees of freedom
## AIC: 2954.2
## 
## Number of Fisher Scoring iterations: 18

However, the AIC value in the model 3 is higher than the model 2. Therefore, we’ll continue with the model 2. Furthermore, the residual deviance is 2870.2 on 6743 degrees of freedom. If the data was truly Binomial distributed, these two numbers have to be the same. It indicates that this data is overdispersed.

6.1.1 Overdispersion

As the overdispersion appeared, we’ll change to quasibinomial method.

glm_bike_q1 <- glm(highdemand ~ month + hour + temperature + humidity + wind_speed + visibility 
              + dew + solar + rainfall + snowfall + holiday, 
              family = "quasibinomial", data = train)
drop1(glm_bike_q1, test="F")
## Single term deletions
## 
## Model:
## highdemand ~ month + hour + temperature + humidity + wind_speed + 
##     visibility + dew + solar + rainfall + snowfall + holiday
##             Df Deviance  F value    Pr(>F)    
## <none>           2865.5                       
## month       11   3513.1 138.5039 < 2.2e-16 ***
## hour        23   4672.4 184.8110 < 2.2e-16 ***
## temperature  1   2870.1  10.7735 0.0010349 ** 
## humidity     1   2893.5  65.8492 5.739e-16 ***
## wind_speed   1   2871.9  14.9870 0.0001093 ***
## visibility   1   2865.8   0.7211 0.3958098    
## dew          1   2869.4   9.2837 0.0023209 ** 
## solar        1   2913.4 112.7553 < 2.2e-16 ***
## rainfall     1   3078.4 500.9053 < 2.2e-16 ***
## snowfall     1   2889.3  55.9760 8.269e-14 ***
## holiday      1   2889.1  55.4114 1.099e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visibility seems not to be significant in the model. It will be removed in the next one.

glm_bike_q2 <- update(glm_bike_q1, . ~ . -visibility)
drop1(glm_bike_q2, test="F")
## Single term deletions
## 
## Model:
## highdemand ~ month + hour + temperature + humidity + wind_speed + 
##     dew + solar + rainfall + snowfall + holiday
##             Df Deviance  F value    Pr(>F)    
## <none>           2865.8                       
## month       11   3518.5 139.5950 < 2.2e-16 ***
## hour        23   4674.5 185.0041 < 2.2e-16 ***
## temperature  1   2870.2  10.3745 0.0012837 ** 
## humidity     1   2895.7  70.2623 < 2.2e-16 ***
## wind_speed   1   2872.0  14.6202 0.0001327 ***
## dew          1   2869.7   9.1196 0.0025383 ** 
## solar        1   2913.4 111.9571 < 2.2e-16 ***
## rainfall     1   3083.3 511.7643 < 2.2e-16 ***
## snowfall     1   2889.9  56.7016 5.734e-14 ***
## holiday      1   2889.5  55.6375 9.808e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since quasibinomial model doesn’t provide an AIC value, we’ll evaluate it in the next session by comparing the correct rates.

6.2 Model Evaluation

fitted(glm_bike2) %>% round(digits =2)
fitted(glm_bike_q2) %>% round(digits =2)
fitted.bikes.disc <- ifelse(fitted(glm_bike2) < 0.5,
                            yes=0, no=1)
b.obs.fit <- data.frame(obs = train$highdemand, fitted = fitted.bikes.disc)
table(obs = b.obs.fit$obs,
      fit = b.obs.fit$fitted)
##    fit
## obs    0    1
##   0 2955  314
##   1  279 3237

2,955 observations were correctly labeled to low-demand and 3,237 were correctly labeled to high-demand. Meanwhile, 279 low-demand observations were falsely classified to be high-demand in this model, while 314 high-demand ones to be low-demand.

fitted.bikes.disc_q <- ifelse(fitted(glm_bike_q2) < 0.5,
                            yes=0, no=1)
b.obs.fit_q <- data.frame(obs = train$highdemand, fitted = fitted.bikes.disc_q)
table(obs = b.obs.fit_q$obs,
      fit = b.obs.fit_q$fitted)
##    fit
## obs    0    1
##   0 2955  314
##   1  279 3237

Two models produce the same result. So we’ll just use one of them for further process.

6.3 Model Prediction

pred.glmb <- ifelse(predict(glm_bike2, type = "response", newdata = test) < 0.5,
                            yes=0, no=1)
matrix.glmb <- confusionMatrix(as.factor(test$highdemand), as.factor(pred.glmb))
matrix.glmb
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 736  61
##          1  76 807
##                                           
##                Accuracy : 0.9185          
##                  95% CI : (0.9043, 0.9311)
##     No Information Rate : 0.5167          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8366          
##                                           
##  Mcnemar's Test P-Value : 0.2317          
##                                           
##             Sensitivity : 0.9064          
##             Specificity : 0.9297          
##          Pos Pred Value : 0.9235          
##          Neg Pred Value : 0.9139          
##              Prevalence : 0.4833          
##          Detection Rate : 0.4381          
##    Detection Prevalence : 0.4744          
##       Balanced Accuracy : 0.9181          
##                                           
##        'Positive' Class : 0               
## 
acc.glmb <- matrix.glmb$overall['Accuracy']

Conclusion: Even though the model reached 0.91 accuracy, the fact of overdispersed data is needed to be considered. There might another way to better fit the binomial model family, or there might be another model beter than this one for the dataset.

7 Support Vector Machine

7.1 Train the Model

From the graphic analysis section, it seems that high-demand factor showed differences in all variables. Therefore, we’ll put them into the svm model. Firstly, we’d separate the data and start building a model.

library(e1071)
svm <- train[, c("highdemand", "month", "wday", "hour", "temperature", "humidity", "wind_speed", 
                      "visibility", "dew", "solar", "rainfall", "snowfall", "holiday")]

svm.test <- test[, c("month", "wday", "hour", "temperature", "humidity", "wind_speed", 
                      "visibility", "dew", "solar", "rainfall", "snowfall", "holiday")]

svm.test_truth <- test[, c("highdemand", "month", "wday", "hour", "temperature", "humidity", 
                           "wind_speed", "visibility", "dew", "solar", "rainfall", "snowfall",
                           "holiday")] %>% pull (highdemand)
  
bikes_svm <- svm(highdemand ~ ., svm, kernel = "linear", scale = TRUE, cost =10)
plot(bikes_svm, svm, temperature ~ dew)

7.2 Make Predictions

pred.svm <- predict(bikes_svm, svm.test)
matrix.svm <- confusionMatrix(svm.test_truth, pred.svm)
matrix.svm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 739  58
##          1  70 813
##                                          
##                Accuracy : 0.9238         
##                  95% CI : (0.9101, 0.936)
##     No Information Rate : 0.5185         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8473         
##                                          
##  Mcnemar's Test P-Value : 0.3309         
##                                          
##             Sensitivity : 0.9135         
##             Specificity : 0.9334         
##          Pos Pred Value : 0.9272         
##          Neg Pred Value : 0.9207         
##              Prevalence : 0.4815         
##          Detection Rate : 0.4399         
##    Detection Prevalence : 0.4744         
##       Balanced Accuracy : 0.9234         
##                                          
##        'Positive' Class : 0              
## 
acc.svm <- matrix.svm$overall['Accuracy']

Conclusion: In the test data using the svm model, 739 observations were correctly labeled to low-demand and 813 were correctly labeled to high-demand. Meanwhile, 70 low-demand observations were falsely classified to be high-demand in this model, while 58 high-demand ones to be low-demand. Overall, the model has 92% accuracy.

8 Neural Network

8.1 Data Understanding

Read necessary libraries and choose necessary variables from the train and test data sets for the model.

library(nnet)
library(gamlss.add)
bikes_nn <- train[, c("highdemand", "month", "wday", "hour", "temperature", "humidity", 
              "wind_speed", "visibility", "dew", "solar", "rainfall", "snowfall", "holiday")]

bikes_nn.test <- test[, c("highdemand", "month", "wday", "hour", "temperature", "humidity", 
               "wind_speed", "visibility", "dew", "solar", "rainfall", "snowfall", "holiday")]
str(bikes_nn)
## 'data.frame':    6785 obs. of  13 variables:
##  $ highdemand : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ month      : Factor w/ 12 levels "1","2","3","4",..: 12 12 12 12 12 12 12 12 12 12 ...
##  $ wday       : Factor w/ 7 levels "1","2","3","4",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ hour       : Factor w/ 24 levels "0","1","2","3",..: 1 2 3 4 5 6 7 8 10 11 ...
##  $ temperature: num  -5.2 -5.5 -6 -6.2 -6 -6.4 -6.6 -7.4 -6.5 -3.5 ...
##  $ humidity   : int  37 38 39 40 36 37 35 38 27 24 ...
##  $ wind_speed : num  2.2 0.8 1 0.9 2.3 1.5 1.3 0.9 0.5 1.2 ...
##  $ visibility : int  2000 2000 2000 2000 2000 2000 2000 2000 1928 1996 ...
##  $ dew        : num  -17.6 -17.6 -17.7 -17.6 -18.6 -18.7 -19.5 -19.3 -22.4 -21.2 ...
##  $ solar      : num  0 0 0 0 0 0 0 0 0.23 0.65 ...
##  $ rainfall   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ snowfall   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ holiday    : Factor w/ 2 levels "Holiday","No Holiday": 2 2 2 2 2 2 2 2 2 2 ...
bikes %>%
  ggplot(aes (x=temperature, fill = highdemand)) +
  geom_histogram(position="stack")

From the histogram of temperature and bike counts, most high demand events are located under higher temperature, while the low demand is more in the lower temperature.

8.2 Build the Network

demand_net <- nnet(highdemand ~ ., data = bikes_nn, size = 10, maxit=100, 
                   range=0.1, decay=5e-4, MaxNWts=65123)
## # weights:  511
## initial  value 4622.897089 
## iter  10 value 3506.628868
## iter  20 value 2827.514316
## iter  30 value 1978.542868
## iter  40 value 1911.970149
## iter  50 value 1711.506443
## iter  60 value 1524.615852
## iter  70 value 1446.100993
## iter  80 value 1379.359325
## iter  90 value 1304.954104
## iter 100 value 1265.941197
## final  value 1265.941197 
## stopped after 100 iterations
plot(demand_net)

demand_net
## a 49-10-1 network with 511 weights
## inputs: month2 month3 month4 month5 month6 month7 month8 month9 month10 month11 month12 wday2 wday3 wday4 wday5 wday6 wday7 hour1 hour2 hour3 hour4 hour5 hour6 hour7 hour8 hour9 hour10 hour11 hour12 hour13 hour14 hour15 hour16 hour17 hour18 hour19 hour20 hour21 hour22 hour23 temperature humidity wind_speed visibility dew solar rainfall snowfall holidayNo Holiday 
## output(s): highdemand 
## options were - entropy fitting  decay=5e-04

8.3 Make Predictions

pred.nn <- predict(demand_net, bikes_nn.test, type = "class")
b_nn <- table(pred=pred.nn, true=bikes_nn.test$highdemand)
b_nn
##     true
## pred   0   1
##    0 735  63
##    1  62 820
matrix.nn <- confusionMatrix(as.factor(bikes_nn.test$highdemand), as.factor(pred.nn))
matrix.nn
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 735  62
##          1  63 820
##                                          
##                Accuracy : 0.9256         
##                  95% CI : (0.912, 0.9377)
##     No Information Rate : 0.525          
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8508         
##                                          
##  Mcnemar's Test P-Value : 1              
##                                          
##             Sensitivity : 0.9211         
##             Specificity : 0.9297         
##          Pos Pred Value : 0.9222         
##          Neg Pred Value : 0.9287         
##              Prevalence : 0.4750         
##          Detection Rate : 0.4375         
##    Detection Prevalence : 0.4744         
##       Balanced Accuracy : 0.9254         
##                                          
##        'Positive' Class : 0              
## 
acc.nn <- matrix.nn$overall['Accuracy']

Check the model accuracy. It shows that this model achieved 0.93 accuracy.

ROC Curve

library(ROCR)
pred_raw <- predict(demand_net, bikes_nn.test, decision.values=TRUE, type="raw")
pred <- ROCR::prediction(pred_raw, bikes_nn.test$highdemand)
perf <- ROCR::performance(pred, "tpr", "fpr")
plot(perf, lwd=2, col="blue")
abline(a=0, b=1)

Conclusion: In the test data using the Neural Network model, 748 observations were correctly labeled to low-demand and 813 were correctly labeled to high-demand. Meanwhile, 70 low-demand observations were falsely classified to be high-demand in this model, while 49 high-demand ones to be low-demand. Overall, the model has 93% accuracy.

9 Cross Validation & Conclusion

To split the data into training and testing data sets, we used 20-fold (80% of the data are used to train the model and 20% to test it). We aim at finding a model that maximizes the predictive performance of demand for bikes rentals.

Three models - Linear Model, Generalized Additive Model (GAM), GLM with Poisson - predict the bike_count variable. We used R-squared as a measure of fit because of the three following reasons: (1) ease of computation, (2) it makes sure that we have less extreme outliers, (3) it is symmetric. The last advantage is significant in our case because over- or underestimated demand for bikes rentals would damage customer experience of the service.

The R-squared values are provided in the table below.

R.df <- data.frame(Linear_Model = round(R.lm,3), GAM = round(R.gam,3), 
                   GLM_Poisson = round(R.glm3, 3), GLM_Quasipoisson = round(R.glmq, 3), 
                   GLM_Negative_Binomial = round(R.glmnb, 3))
R.df

Three other models - GLM with Binomial, Support Vector Machine, Neural Network - predict the binary highdemand variable. As a measure of predictive performance for these models, we used the accuracy value from confusion matrix (the degree of closeness to actual value).

acc.df <- data.frame(GLM_Binomial = round(acc.glmb,3), 
                     SVM = round(acc.svm,3), Neural_Network = round(acc.nn, 3))
acc.df

Neural Network has the highest accuracy (0.93). Thus, this model is recommended to forecast whether the level of demand is low or high.

9.1 Conclusion

Predict Bike Counts

The prediction of bike count could help the business understand how much bikes they should get prepared for the user needs and how many maintenance workers to be deployed on this day under certain weather and other situations. When the number of bikes and the number of workers fits the number of bikes to be used on that day, the company could offer quality customer experiences for its users - smooth and convenient biking experience.

The R-squared values of three models - Linear Model, Generalized Additive Model (GAM), GLM with Poisson - show that GLM with Quasiposson has the highest R-squared values (79%). However, all variables in the model of GLM with Quasiposson are insignificant (seen in the modeling section 5.3). The model is therefore not applicable. The better model for the business to predict is GLM with Negative_Binomial, which could explain 72.6% of the data.

Predict If-Highdemand

The assumption here is that, if the rental bike count is over 500, it belongs to high-demand situation; on the other hand, if it’s below 500, it belongs to low-demand situation.

The prediction of high or low-demand could help the business allocate its resources better, from the number of workers to the number of bikes needed to be repaired. In the time of low-demand, for example, the company could send more workers on bike maintenance and repairing in the factory and allow only few workers outside monitoring bike-sharing stations. In the time of high-demand, instead, the company needs more bikes being used and more workers outside monitoring the stations.

The accuracy values resulted from three models - GLM with Binomial, Support Vector Machine, Neural Network - show that Neural Network would be the better model for the company to predict high- or low-demand, with about 92.6% accuracy.

Modeling - a Never-Ending Story

We did split the original data set into train and test purposes. In other words, the accuracy of the models was tested with previous data. However, business operation, market competition and customer behavior change every day. Moreover, we based our predictions on the historical weather data. Our models can be inaccurate in future predictions because they will be based on the weather forecast and their accuracy depends on it. So, we do not advise our client to make a monthly forecast but rather a weekly forecast.

Furthermore, the models included in this project aren’t every modeling methods existing in the field. The results could be seen as a reference, rather than a best solution. Modeling, anyway, is a never-ending story - more latest data and more understanding of various predictors would definite improve and change the models.

10 ABM & ABC

10.1 Agent-Based Model (ABM)

10.1.1 Installing packages for NetLogoR

Since the NetLogoR package had been removed from the CRAN repository, we downloaded 20 libraries required for its installation.

#install.packages("ellipsis")
#install.packages("xfun")
#install.packages("data.table", dependencies=TRUE)
#install.packages("devtools")
#require(devtools)

#packageurl_1 <- "https://cran.r-project.org/src/contrib/fastmatch_1.1-0.tar.gz"
#packageurl_2 <-"https://stat.ethz.ch/CRAN//src/contrib/DBI_1.1.1.tar.gz"
#packageurl_3 <-"https://cran.r-project.org/src/contrib/bit_4.0.4.tar.gz"
#packageurl_4 <-"https://cran.r-project.org/src/contrib/blob_1.2.1.tar.gz"
#packageurl_5 <-"https://cran.r-project.org/src/contrib/plogr_0.2.0.tar.gz"
#packageurl_6 <- "https://cran.microsoft.com/snapshot/2017-08-06/src/contrib/bit64_0.9-7.tar.gz"
#packageurl_7 <-"https://cran.r-project.org/src/contrib/RSQLite_2.2.7.tar.gz"
#packageurl_8 <- "https://cran.r-project.org/src/contrib/Archive/Require/Require_0.0.12.tar.gz"
#packageurl_9 <- "https://cran.r-project.org/src/contrib/fpCompare_0.2.3.tar.gz"
#packageurl_10 <- "https://cran.r-project.org/src/contrib/sp_1.4-5.tar.gz"
#packageurl_11 <- "https://cran.r-project.org/src/contrib/raster_3.4-10.tar.gz"
#packageurl_12 <- "https://cran.r-project.org/src/contrib/reproducible_1.2.7.tar.gz"
#packageurl_13 <- "https://cran.r-project.org/src/contrib/htmlTable_2.2.1.tar.gz"
#packageurl_14 <- "https://cran.r-project.org/src/contrib/viridis_0.6.1.tar.gz"
#packageurl_15 <- "https://cran.r-project.org/src/contrib/Hmisc_4.5-0.tar.gz"
#packageurl_16 <- "https://cran.r-project.org/src/contrib/Archive/SpaDES.tools/SpaDES.tools_0.3.6.tar.gz"
#packageurl_17 <- "https://cran.r-project.org/src/contrib/Archive/NetLogoR/NetLogoR_0.3.7.tar.gz"

#install.packages(packageurl_1, repos=NULL, type="source")
#install.packages(packageurl_2, repos=NULL, type="source")
#install.packages(packageurl_3, repos=NULL, type="source")
#install.packages(packageurl_4, repos=NULL, type="source")
#install.packages(packageurl_5, repos=NULL, type="source")
#install.packages(packageurl_6, repos=NULL, type="source")
#install.packages(packageurl_7, repos=NULL, type="source")
#install.packages(packageurl_8, repos=NULL, type="source")
#install.packages(packageurl_9, repos=NULL, type="source")
#install.packages(packageurl_10, repos=NULL, type="source")
#install.packages(packageurl_11, repos=NULL, type="source")
#install.packages(packageurl_12, repos=NULL, type="source")
#install.packages(packageurl_13, repos=NULL, type="source")
#install.packages(packageurl_14, repos=NULL, type="source")
#install.packages(packageurl_15, repos=NULL, type="source")
#install.packages(packageurl_16, repos=NULL, type="source")
#install.packages(packageurl_17, repos=NULL, type="source")

Downloading libraries.

#library(NetLogoR)
library(stringr)
library(minpack.lm)

10.1.2 Running an ABM simulation for spreading of news/diseases.

The simulation was run in a separate R-code file that is located in our project zip file (ABM_simulation.R). The code is commented out here as it takes a long time to complete all computations. During the simulation process, the graphic that slows down the computation was disabled.

Simulation of all the combinations of pubs [1-20] and agents [40-100] with for loop.

#for (number_agents in 40:100) {
    #for (number_pubs in 1:20) {
        ### 1. DEFINE THE SPACE AND AGENTS ###
        
        ## simulations parameters
        
        #simtime<-200               ## duration time of the simulation
        #gridSize_x<-15             ## number of patches in the grid
        #gridSize_y<-15
        #displacement_normal<-0.1   ## speed of moving agents 
        #displacement_pub<-0.01     ## speed in the pub
        #plot_data_out<-numeric()
        
        ## world set up
        
        #w1 <- createWorld(minPxcor = 0, maxPxcor = gridSize_x-1, minPycor = 0, 
        #                  maxPycor = gridSize_y-1)
        #x_pub <- randomPxcor(w1,number_pubs)
        #y_pub <- randomPycor(w1,number_pubs)
        #w1 <- NLset(world = w1, agents = patches(w1), val = 0) 
        #w1 <- NLset(world = w1, agents = patch(w1, x_pub, y_pub), val = 1) 
        
        ## agents set up
        
        #t1 <- createTurtles(n = number_agents, coords = randomXYcor(w1, n = number_agents), 
        #                    breed="S", color="black") 
        #t1 <- NLset(turtles = t1, agents = turtle(t1, who = 0), var = "breed", val = "I")
        #t1 <- NLset(turtles = t1, agents = turtle(t1, who = 0), var = "color", val = "red") 
        #t1 <- turtlesOwn(turtles = t1, tVar = "displacement", tVal = displacement_normal)
        
        #plot(w1, axes = 0, legend = FALSE, par(bty = 'n'))
        #points(t1, col = of(agents = t1, var = "color"), pch = 20)
        
        ### 2. THE SIMULATION TIME LOOP ###
        
        #for (time in 1:simtime) { 
            
        #    t1 <- fd(turtles = t1, dist=t1$displacement, world = w1, torus = TRUE, out = FALSE) 
        #    t1 <- right(turtles = t1, angle = sample(-20:20, 1, replace = F)) 
        #    plot(w1, axes = 0, legend = FALSE, par(bty = 'n'))
        #    points(t1, col = of(agents = t1, var = "color"), pch = 20)
        #    meet <- turtlesOn(world = w1, turtles = t1, agents = t1[of(agents = t1,
        #                var = "breed")=="I"]) 
        #    t1 <- NLset(turtles = t1, agents = meet, var = "breed", val = "I") 
        #    t1 <- NLset(turtles = t1, agents = meet, var = "color", val = "red")
            
            ## agents that enter a pub spend more time there
            
        #    pub <- turtlesOn(world = w1, turtles = t1, agents = patch(w1, x_pub, y_pub))
        #    t1 <- NLset(turtles = t1, agents = turtle(t1, who = pub$who), 
        #                var = "displacement", val = displacement_pub)      ## if enters the pub
        #    t1 <- NLset(turtles = t1, agents = turtle(t1, who = t1[-(pub$who+1)]$who), 
        #                var = "displacement", val = displacement_normal)   ## if exits the pub
        #    Sys.sleep(0.1)
            
            ## store time-course data for plotting in the end
            
        #    contaminated_counter <- sum(str_count(t1$color, "red"))
        #    tmp_data <- c(time,contaminated_counter)
        #    plot_data_out <- rbind(plot_data_out, tmp_data)
            
        #}
        
        ### 3. PLOTTING AND FITTING SIMULATED DATA ###
        
        ## perform non-linear curve fitting of the data 
       
        #df <- as.data.frame(plot_data_out)
        #names(df) <- c("time","contaminated_counter")
        #x <- df$time
        #y <- df$contaminated_counter
        
        ## give arbitrarily initial guesses to 3,4,600,1000 and fit with 4-parameters logistic equation 
        
        #model <- nlsLM(y ~ d + (a-d) / (1 + (x/c)^b), start = list(a = 3, b = 4, c = 600, d = 1000)) 
        
        ## make a line with the fitting model that goes through the data
        
        #fit_x <- data.frame(x = seq(min(x),max(x),len = simtime))
        #fit_y <- predict(model, newdata = fit_x)
        #fit_df <- as.data.frame(cbind(fit_x,fit_y))
        #names(fit_df) <- c("x","y")
        #fitted_function <- data.frame(x = seq(min(x),max(x),len = simtime))
        #lines(fitted_function$x,predict(model,fitted_function = fitted_function))
        
        ## store summary statistics in a vector to be appended after each iteration to the output file
        
        #simulation_run_name <- paste0("sim_",number_agents,"_",number_pubs)
        #varied_params <- c(number_agents,number_pubs)
        #summary_stat <- c(simulation_run_name, varied_params, as.vector(model$m$getPars()) )
        
        ## save summary statistics of all the performed simulations in file with:
        ## Simulation ID, parameters for simulation, outcome of the curve
       
        #write.table(as.data.frame(t(summary_stat)), "./summary_stat.csv", sep = ",", col.names = FALSE, 
        #                            row.names=FALSE, append = TRUE) 
    #}
    
#}

As a result, a corresponding dataset of 1,220 (20 x 61) simulations was built, and it was exported as “summary_stat.csv”.

10.2 Approximate Bayesian Computation (ABC)

10.2.1 Downloading libraries and importing observed and simulation ABM data

library(abc)

obs_data <- c(3, 2, 1143, 1655)
sim_param <- read.table(file="sim_param.dat",header=FALSE, sep = ",")
sim_data <- read.table(file="sim_data.dat",header=FALSE, sep = ",")

10.2.2 Running the ABC model

To compare the simulated dataset with the “observed data” of 3, 2, 1143, 1655, we used an ABC model with the “log” transformation applied to the parameter values and the “neuralnet” method.

result <- abc(target=obs_data,
           param=sim_param,
           sumstat=sim_data,
           tol=0.005,
           transf=c("log"),
           method="neuralnet")
## 12345678910
## 12345678910

10.2.3 Outcomes of the ABC model.

The graphical report generated by the ABC model. The table with the values of the inferred posterior distribution.

plot(result, param=sim_param)

result$adj.values
##            V1       V2
## [1,] 51.08730 9.155542
## [2,] 51.06800 9.155598
## [3,] 51.08820 9.306829
## [4,] 52.12616 1.111161
## [5,] 51.06791 9.329076
## [6,] 51.08854 9.157650
## [7,] 51.06988 9.285002